install.packages("package_name")
tidyverse
readxl
janitor
Example datasets:
iris
Built-in dataset that includes petal and sepal measurements for three
iris species
ga_weather_data
Includes daily weather observations from a subset of weather stations in
Athens, GA and Atlanta, GA from April-June 2024
To follow along, start by running this code:
read_csv() + read_excel()janitor packageTidy data is the ideal structure for working with Tidyverse packages and functions
read_csv()# no column names
my_data <- read_csv("ga_weather_data_apr2024.csv", col_names = FALSE)
# provide a vector of column names
my_data <- read_csv("ga_weather_data_apr2024.csv", col_names = (c("station_id", "station_name", "date", "precip", "tmax", "tmin")))
# read multiple files with file names in a vector
my_data <- read_csv(c("ga_weather_data_apr2024.csv", "ga_weather_data_may2024.csv", "ga_weather_data_jun2024.csv"))
# skip first line
my_data <- read_csv("ga_weather_data_apr2024.csv", skip = 1)
# specify which values should be read in as NA
my_data <- read_csv("ga_weather_data_apr2024.csv", na = c("NA", "N/A", "na", ""))readxl package:
read_excel()read_csv, the only required argument it a file
name or path in quotes# select a specific sheet in a workbook
# if you don't specify a sheet, it will read in the first one
my_data <- read_excel("ga_weather_data_apr2024.xlsx", sheet = "apr2024")
my_data <- read_excel("ga_weather_data_apr2024.xlsx", sheet = 1)
# specify data type for each column
my_data <- read_excel("ga_weather_data_apr2024.xlsx", col_types = c("text", "text", "date", "numeric", "numeric", "numeric"))readxl
package:read_excel() with multiple sheetsUse excel_sheets() to get a vector of sheet names
bind_rows() + bind_cols()Each of these sheets contains the same variables (column names), but
with different data for different months.
bind_rows() stacks them on top of each other to make a
single dataset
Columns don’t have to be in the same order in each dataset -
bind_rows() will match columns as long as the names are
identical
The resulting data set is longer - more rows
my_data_apr2024 <- read_excel("ga_weather_data.xlsx", sheet = "april2024")
my_data_may2024 <- read_excel("ga_weather_data.xlsx", sheet = "may2024")
my_data_jun2024 <- read_excel("ga_weather_data.xlsx", sheet = 3)
# .id argument creates a new identifier column to link each observation to its original sheet
my_data <- bind_rows(my_data_apr2024, my_data_may2024, my_data_jun2024, .id = "id")
bind_cols() does the same, but stacks each dataset
side-by-side
The resulting dataset is wider - more columns
Be careful - make sure rows from each data frame are all in
the same order
because R doesn’t match rows with bind_cols() like
it matches columns with bind_rows()
GA weather data file
## [1] "april2024" "may2024" "june2024"
ga_weather_april <- read_excel("ga_weather_data.xlsx", sheet = "april2024") # or sheet = 1
ga_weather_may <- read_excel("ga_weather_data.xlsx", sheet = "may2024") # or sheet = 2
ga_weather_june <- read_excel("ga_weather_data.xlsx", sheet = "june2024") # or sheet = 3
ga_weather <- bind_rows(ga_weather_april, ga_weather_may, ga_weather_june)glimpse()print() in base R
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
janitor package:
clean_names()/
Cleans up column names to make them easier to work with (e.g., no spaces, special characters, all lowercase, etc.)
janitor package:
clean_names()Thinking back to the Tidyverse style guide and best practices, what things might we want to change about these column names?
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
janitor package, cont.Sometimes R doesn’t recognize that the first row of your data frame
contains column names instead of data
We can use row_to_names() to elevate a row (usually the top
one) to column names
Once you’ve discerned the meaning and reason behind NAs or missing
values in a dataset, here are a few options for handling them:
A somewhat extreme option is to use drop_na() to drop
all rows that have an NA in one or more specified columns
As we can see, this drastically changes the size of the dataset
## Rows: 364
## Columns: 6
## $ STATION <chr> "USW00053863", "USW00053863", "USW00053863", "USW00053863", "U…
## $ NAME <chr> "ATLANTA DEKALB PEACHTREE AIRPORT, GA US", "ATLANTA DEKALB PEA…
## $ DATE <dttm> 2024-04-01, 2024-04-02, 2024-04-03, 2024-04-04, 2024-04-05, 2…
## $ PRCP <dbl> 0.00, 0.41, 0.30, 0.00, 0.00, 0.00, 0.00, 0.00, 0.23, 0.10, 0.…
## $ TMAX <dbl> 75, 79, 66, 62, 61, 64, 72, 75, 64, 71, 75, 67, 76, 83, 83, 82…
## $ TMIN <dbl> 61, 62, 49, 42, 41, 36, 37, 50, 56, 58, 54, 49, 44, 47, 56, 56…
## Rows: 1,338
## Columns: 6
## $ STATION <chr> "USW00053863", "USW00053863", "USW00053863", "USW00053863", "U…
## $ NAME <chr> "ATLANTA DEKALB PEACHTREE AIRPORT, GA US", "ATLANTA DEKALB PEA…
## $ DATE <dttm> 2024-04-01, 2024-04-02, 2024-04-03, 2024-04-04, 2024-04-05, 2…
## $ PRCP <dbl> 0.00, 0.41, 0.30, 0.00, 0.00, 0.00, 0.00, 0.00, 0.23, 0.10, 0.…
## $ TMAX <dbl> 75, 79, 66, 62, 61, 64, 72, 75, 64, 71, 75, 67, 76, 83, 83, 82…
## $ TMIN <dbl> 61, 62, 49, 42, 41, 36, 37, 50, 56, 58, 54, 49, 44, 47, 56, 56…
Another option is to use fill() to fill in NAs with the
value above or below it in that column
In the case of our ga_weather data, this isn’t ideal
Yet another option is use replace_na() to replace all
NAs with a specific value
In this case, I’ve replaced NAs in the TMAX column with 0 (again, not
ideal for these data)
## Rows: 1,338
## Columns: 6
## $ STATION <chr> "USW00053863", "USW00053863", "USW00053863", "USW00053863", "U…
## $ NAME <chr> "ATLANTA DEKALB PEACHTREE AIRPORT, GA US", "ATLANTA DEKALB PEA…
## $ DATE <dttm> 2024-04-01, 2024-04-02, 2024-04-03, 2024-04-04, 2024-04-05, 2…
## $ PRCP <dbl> 0.00, 0.41, 0.30, 0.00, 0.00, 0.00, 0.00, 0.00, 0.23, 0.10, 0.…
## $ TMAX <dbl> 75, 79, 66, 62, 61, 64, 72, 75, 64, 71, 75, 67, 76, 83, 83, 82…
## $ TMIN <dbl> 61, 62, 49, 42, 41, 36, 37, 50, 56, 58, 54, 49, 44, 47, 56, 56…
## No variable names specified - using all columns.
## No duplicate combinations found of: STATION, NAME, DATE, PRCP, TMAX, TMIN
## # A tibble: 0 × 7
## # ℹ 7 variables: STATION <chr>, NAME <chr>, DATE <dttm>, PRCP <dbl>,
## # TMAX <dbl>, TMIN <dbl>, dupe_count <int>
pivot_wider adds columns to a dataset
weather_wide <- pivot_wider(ga_weather, id_cols = DATE, names_from = "NAME", values_from = PRCP)
weather_wide## # A tibble: 91 × 18
## DATE ATLANTA DEKALB PEACHTREE AIRPORT…¹ ATLANTA 5.3 NE, GA U…²
## <dttm> <dbl> <dbl>
## 1 2024-04-01 00:00:00 0 NA
## 2 2024-04-02 00:00:00 0.41 NA
## 3 2024-04-03 00:00:00 0.3 1.08
## 4 2024-04-04 00:00:00 0 0
## 5 2024-04-05 00:00:00 0 NA
## 6 2024-04-06 00:00:00 0 NA
## 7 2024-04-07 00:00:00 0 0
## 8 2024-04-08 00:00:00 0 0
## 9 2024-04-09 00:00:00 0.23 0.38
## 10 2024-04-10 00:00:00 0.1 0.43
## # ℹ 81 more rows
## # ℹ abbreviated names: ¹`ATLANTA DEKALB PEACHTREE AIRPORT, GA US`,
## # ²`ATLANTA 5.3 NE, GA US`
## # ℹ 15 more variables: `ATHENS 3.8 WNW, GA US` <dbl>,
## # `ATHENS 4.7 ESE, GA US` <dbl>, `ATHENS 4.6 SE, GA US` <dbl>,
## # `ATHENS 3.2 NW, GA US` <dbl>, `ATHENS 4.5 SE, GA US` <dbl>,
## # `ATHENS 7.6 SE, GA US` <dbl>, `ATHENS 0.6 ENE, GA US` <dbl>, …
pivot_longer adds rows to a dataset
## # A tibble: 1,547 × 3
## DATE station prcp
## <dttm> <chr> <dbl>
## 1 2024-04-01 00:00:00 ATLANTA DEKALB PEACHTREE AIRPORT, GA US 0
## 2 2024-04-01 00:00:00 ATLANTA 5.3 NE, GA US NA
## 3 2024-04-01 00:00:00 ATHENS 3.8 WNW, GA US 0
## 4 2024-04-01 00:00:00 ATHENS 4.7 ESE, GA US 0
## 5 2024-04-01 00:00:00 ATHENS 4.6 SE, GA US 0
## 6 2024-04-01 00:00:00 ATHENS 3.2 NW, GA US 0
## 7 2024-04-01 00:00:00 ATHENS 4.5 SE, GA US 0
## 8 2024-04-01 00:00:00 ATHENS 7.6 SE, GA US 0
## 9 2024-04-01 00:00:00 ATHENS 0.6 ENE, GA US 0
## 10 2024-04-01 00:00:00 ATLANTA FULTON CO AIRPORT, GA US 0
## # ℹ 1,537 more rows
Left hand side (LHS) is “piped” in as first argument to the function
on right hand side (RHS)
It’s like saying take this (LHS) and then do this this
(RHS)
Let’s revisit an earlier exercise where we need to read in and
combine several excel sheets
excel_sheets("ga_weather_data.xlsx")
ga_weather_april <- read_excel("ga_weather_data.xlsx", sheet = "april2024") # or sheet = 1
ga_weather_may <- read_excel("ga_weather_data.xlsx", sheet = "may2024") # or sheet = 2
ga_weather_june <- read_excel("ga_weather_data.xlsx", sheet = "june2024") # or sheet = 3
ga_weather <- bind_rows(ga_weather_april, ga_weather_may, ga_weather_june)Using pipes:
Evaluate column names, duplicates, and NAs or missing
values
Reshape the data so that there is a single column for each
date
[HINT: this will involve a pivot_longer: prcp, tmin, tmax and a pivot_wider: date]
Other resources:
Packages: readr, readxl, tidyr,
magrittr, janitor
Data import cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/data-import.pdf
Data tidying with tidyr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/tidyr.pdf
Magrittr package: https://magrittr.tidyverse.org/index.html
Janitor package: https://sfirke.github.io/janitor/index.html